In Press

نویسندگان

  • Steven J. Luck
  • Nicholas Gaspelin
چکیده

Event-related potential (ERP) experiments generate massive data sets, often containing thousands of values for each participant, even after averaging. The richness of these data sets can be very useful in testing sophisticated hypotheses, but this richness also creates many opportunities to obtain effects that are statistically significant but do not reflect true differences among groups or conditions (bogus effects). The purpose of this paper is to demonstrate how common and seemingly innocuous methods for quantifying and analyzing ERP effects can lead to very high rates of significant-but-bogus effects, with the likelihood of obtaining at least one such bogus effect exceeding 50% in many experiments. We focus on two specific problems: using the grand average data to select the time windows and electrode sites for quantifying component amplitudes and latencies, and using one or more multi-factor statistical analyses. Reanalyses of prior data and simulations of typical experimental designs are used to show how these problems can greatly increase the likelihood of significant-but-bogus results. Several strategies are described for avoiding these problems and for increasing the likelihood that significant effects actually reflect true differences among groups or conditions. 3 It can seem like a miracle when a predicted effect is found to be statistically significant in an event-related potential (ERP) experiment. A typical ERP effect may be only a millionth of a volt or a hundredth of a second, and these effects can easily be obscured the many sources of biological and environmental noise that contaminate ERP data. Averaging together a large number of trials can improve reliability and statistical power, but even with an infinite number of trials there would still be variance due to factors such as mind wandering. All these sources of variance can make it very difficult for a 1 μV or 10 ms difference between groups or conditions to reach the .05 threshold for statistical significance. Consequently, when an experiment is designed to look for a small but specific effect, much care is needed to ensure that the predicted effect will be statistically significant if it is actually present. On the other hand, it is extraordinarily easy to find statistically significant but unpredicted and unreplicable effects in ERP experiments. ERP data sets are so rich that random variations in the data have a good chance of producing statistically significant effects at some times points and in some electrode sites if enough analyses are conduct. These effects are bogus (i.e., not genuine), but it can be difficult for the researchers, the reviewers of a journal submission, or the readers of a published article to know if a given effect is real or bogus. This likely leads to the publication of a large number of effects that are bogus but have the imprimatur of statistical significance. Estimating how often this happens is difficult, but there is growing evidence that many published results in psychology (Open Science Collaboration, 2015), neuroscience (Button, Ioannidis, Mokrysz, Nosek, Flint, Robinson, & Munafò, 2013), and oncology (Prinz, Schlange, & Asadullah, 2011) are not replicable. Many factors contribute to the lack of replicability, but one of them is what Simmons, Nelson, and Simonsohn (2011) called experimenter degrees of freedom. This is the idea that experimenters can analyze their data in 4 many different ways, and if the methods that experimenters choose are selected after the data have been viewed, this will dramatically increase the likelihood that bogus effects reach the criterion for statistical significance. Experimenters typically have more degrees of freedom in the analysis of ERP experiments than in the analysis of behavioral experiments, and this likely leads to the publication of many significant-but-bogus ERP findings. The purpose of the present paper is to demonstrate the ease of finding significant-but-bogus effects in ERP experiments and to provide some concrete suggestions for avoiding these bogus effects. Following the approach of Simmons et al. (2011), we will begin by showing how the data from an actual experiment can be analyzed inappropriately to produce significant-but-bogus effects. We will then discuss in detail a common practice—the use of analysis of variance with large numbers of factors—that can lead to significant-but-bogus effects in the vast majority of experiments. We will also provide concrete suggestions for avoiding findings that are statistically significant but are false and unreplicable. How to Find Significant Effects in Any ERP Experiment: An Example Our goal in this section is to show how a very reasonable-sounding analysis strategy can lead to completely bogus conclusions. To accomplish this, we took a subset of the data from an actual published ERP study (Luck, Kappenman, Fuller, Robinson, Summerfelt, & Gold, 2009) and performed new analyses that sound reasonable but were in fact inappropriate and led to completely bogus effects. Note that everything about the experiment and re-analysis will be described accurately, with the exception of one untrue feature of the design that will be revealed later and will make it clear that any significant results must be bogus in this re-analysis (see Luck, 2014 for a different framing of these results). Although the original study compared a 5 patient group with a control group, the present re-analysis focuses solely on within-group effects from a subset of 12 control subjects. Design and Summary of Findings In this study, data were obtained from 12 healthy adults in the visual oddball task shown in Figure 1. Letters and digits were presented individually at the center of the video display, and participants were instructed to press with the left hand when a letter appeared and with the right hand when a digit appeared (or vice versa). The stimuli in a given block consisted of 80% letters and 20% digits (or vice versa), and the task required discriminating the category of the stimulus and ignoring the differences among the members of a given category (i.e., all letters required one response and all digits required the other response). However, unbeknownst to the participants, 20% of the stimuli in the frequent category were exact repetitions of the preceding stimulus (such as the repetition of the letter G in the example shown in Figure 1). The goal of the analyses presented here was to determine whether these frequent repetitions were detected, along with the time course of the differential processing of repetitions and non-repetitions. Previous research has shown that repetitions of the rare category can influence the P3 wave (DuncanJohnson & Donchin, 1977; Johnson & Donchin, 1980), but the effects of repetitions on individual exemplars of the frequent category are not known. The experiment included 800 trials per participant, with 640 stimuli in the frequent category and 160 stimuli in the rare category. Within the frequent category, there were 128 repetitions and 512 non-repetitions for each participant. As has been found many times before, the N2 and P3 waves were substantially larger for the rare category than for the frequent category (see Luck et al., 2009 for a comparison of these waveforms). The present analyses focused solely on the frequent category to determine whether there were any differences in the ERPs for frequent 6 repetitions and frequent non-repetitions. Standard recording, filtering, artifact rejection, and averaging procedures were used (see Luck et al., 2009 for details). Figure 2 shows the grand average waveforms for the frequent repetitions and the frequent non-repetitions. There were two clear differences in the waveforms between these trial types. First, repetitions elicited a larger P1 wave than non-repetitions, especially over the right hemisphere. Second, repetitions elicited a larger P2 wave at the central and parietal electrode sites. We performed standard analyses to determine whether these effects were statistically significant. In these analyses, P1 amplitude was quantified as the mean voltage between 50 and 150 ms poststimulus, and the amplitude measures were analyzed in a three-way analysis of variance (ANOVA) with factors of trial type (repetition versus non-repetition), electrode hemisphere (left versus right), and within-hemisphere electrode position (frontal pole, lateral frontal, mid frontal, central, or parietal). The effect of trial type was only marginally significant (p = .051), but the interaction between trial type and hemisphere was significant (p = .011). Because of the significant interaction, follow-up comparisons were performed in which the data from the left and right hemisphere sites were analyzed in separate ANOVAs. The effect of trial type was significant for the right hemisphere (p = .031) but not for the left hemisphere. These results are consistent with the observation that the P1 at right hemisphere electrode sites was larger for the repetitions than for the non-repetitions. P2 amplitude was quantified as the mean voltage between 150 and 250 ms poststimulus at the central and parietal electrode sites, and the amplitude measures were again analyzed in an ANOVA with factors of trial type, electrode hemisphere, and within-hemisphere electrode 7 position. The effect of trial type was significant (p = .026), supporting the observation that the P2 was larger for the repetitions than for the non-repetitions at the central and parietal electrodes. Together, the P1 and P2 results indicate that repetitions of specific exemplars of the frequent category are detected even when this is not required by the task. Further, these results indicate that the repetition happens rapidly (within approximately 100 ms of stimulus onset) and also impacts later processing (ca. 200 ms). One might be concerned that many fewer trials contributed to the averaged ERP waveforms for the repetition waveforms than for the non-repetition waveforms. However, this is not actually a problem given that mean amplitude, rather than peak amplitude, was used to quantify the components. Mean amplitude is an unbiased measure, which means that it is equally likely to be larger or smaller than the true value and is not more likely to produce consistently larger values in noisier waveforms (see Luck, 2014). Actual Design and Bogus Results Although the analyses we have presented follow common practices, and might not be criticized in a journal submission, our analysis strategy was seriously flawed. The statistically significant effects were, in fact, solely a result of random noise in the data. Ordinarily, one cannot know whether a set of significant effects are real or bogus, but we know with completely certainty that any effects involving stimulus repetition in the present experiment were bogus because there was not actually a manipulation of stimulus repetition. This was just a cover story to make the data analysis sound plausible. Instead of manipulating repetitions versus nonrepetitions, we randomly sorted the frequent stimuli for each subject into a set of 512 trials that we arbitrarily labeled “non-repetitions” and 128 trials that we arbitrarily labeled “repetitions.” In other words, we simulated an experiment in which the null hypothesis was known to be true: The 8 “repetition” and “non-repetition” trials were selected at random from the same population of trials. Thus, we know that the null hypothesis was true for any effects involving the “trial type” factor, and we know with certainty that the significant main effect of trial type for the P2 wave and the significant interaction between trial type and hemisphere for the P1 wave are bogus effects. This also means that our conclusions about the presence and timing of repetition effects are false. The problem with our strategy for analyzing this data set is that we used the observed grand average waveforms to select the time period and electrode sites that were used in the analysis. Even though this was a relatively simple ERP experiment, there were so many opportunities for noise to create bogus differences between the waveforms that we were able to find time periods and electrode sites where these differences were statistically significant. There is nothing special about this experiment that allowed us to find bogus differences; almost any ERP experiment will yield such a rich data set that noise will lead to statistically significant effects if the choice of analysis parameters is based on the observed differences among the waveforms. The Problem of Multiple Implicit Comparisons This approach to data analysis leads to the problem of multiple implicit comparisons (Luck, 2014): the experimenter is implicitly making hundreds of comparisons between the observed waveforms by visually comparing them and performing explicit statistical comparisons only for the time regions and scalp regions in which the visual comparisons indicate that differences are present. To show how this problem arises, we re-analyzed the data from the aforementioned experiment in a way that makes the problem of multiple comparisons explicit. Specifically, we performed a separate t test at each individual time point and at each electrode site to compare the “repetition” and “non-repetition” waveforms. This yielded hundreds of individual p values, 9 many of which indicated significant differences between the “repetition” and “non-repetition” waveforms. It is widely known that this strategy is inappropriate and leads to a high rate of false positives. If we had tried to publish the results with hundreds of individual t tests and no correction for multiple comparisons, any reasonable reviewer would have cited this as a major flaw and recommended rejection. Indeed, none of the differences remained significant when we applied a Bonferroni correction for multiple comparisons. Although it is widely understood that performing large numbers of explicit statistical comparisons leads to a high probability of bogus differences, it is less widely appreciated that researchers are implicitly conducting multiple comparisons when they use the observed ERP waveforms to guide their choice of explicit statistical comparisons. This is exactly what we did in the aforementioned experiment: We looked at the grand average waveforms, saw some differences, and decided to conduct statistical analyses of the P1 and P2 waves using specific time ranges and electrode sites that showed apparent differences between conditions. In other words, differences between the waveforms that were entirely due to noise led us to focus on specific time periods and electrode sites, and this biased us to find significant-but-bogus effects in a small number of explicit statistical analyses. Using the grand averages to guide the analyses in this manner leads to the same end result as performing hundreds of explicit t tests without a correction for multiple comparisons—namely a high rate of spurious findings—and yet it is very easy to “get away with” this approach when publishing ERP studies. Similar issues arise in fMRI research (Vul, Harris, Winkielman, & Pashler, 2009). If we had submitted a paper with the small set of ANOVA results described earlier, we could have told a reasonably convincing story about why we expected that the P1 and P2 waveforms would be sensitive to the detection of task-irrelevant stimulus repetitions (which is 10 known as hypothesizing after the results are known or HARKing – see Kerr, 1998). Moreover, we could have concluded that repetitions are detected as early as 100 ms poststimulus, and it is plausible that the paper would have been accepted for publication. Thus, completely bogus differences that are a result of random variation could easily lead to effects that are convincing and possibly publishable, especially if they are described as “predicted results” rather than post hoc findings. Thus, an unethical researcher who wishes to obtain “publishable effects” in a given experiment regardless of whether the results are real would be advised to look at the grand averages, find time ranges and electrode sites for which the conditions differ, measure the effects at those time ranges and electrode sites, report the statistical analyses for those measurements, and describe the data as fitting the predictions of a “hypothesis” that was actually developed after the waveforms were observed. However, a researcher who wants to avoid significant-but-bogus effects would be advised to focus on testing a priori predictions without using the observed data to guide the selection of time windows or electrode sites, treating any other effects (even if highly significant) as being merely suggestive until replicated. How to Avoid Biased Measurement and Analysis Procedures In this section, we will describe several approaches that can be taken to avoid the bogus findings that are likely to occur if the grand average waveforms are used to guide the measurement and analysis procedures (see Luck, 2014 for additional discussion). First, however, we would like to note that there is a tension between the short-term desire of individual researchers to publish papers and the long-term desire of the field as a whole to minimize the number of significant-but-bogus findings in the literature. Most approaches for reducing the Type I error rate will simultaneously decrease the number of significant and therefore 11 publishable results. Moreover, many of these approaches will decrease statistical power, meaning that the rate of Type II errors (false negatives) will increase as the rate of Type I errors (false positives) decreases. It is therefore unrealistic to expect individual researchers—especially those who are early in their careers—to voluntarily adopt data analysis practices that are good for the field but make it difficult for them to get their own papers published. The responsibility therefore falls mainly on journal editors and reviewers to uniformly enforce practices that will minimize the Type I error rate. The need for editors and reviewers to enforce these practices is one of the key points of the recently revised publication guidelines of the Society for Psychophysiological Research (Keil, Debener, Gratton, Junhöfer, Kappenman, Luck, Luu, Miller, & Yee, 2014). A Priori Measurement Parameters When possible, the best way to avoid biasing ERP component measurement procedures toward significant-but-bogus effects is usually to define the measurement windows and electrode sites before seeing the data. However, this is not always possible. For example, the latency of an effect may vary across studies as a function of low-level sensory factors, such as stimulus luminance and discriminability, making the measurement windows from previous studies inappropriate for a new experiment. Also, many studies are sufficiently novel that prior studies with similar methods are not available to guide the analysis parameters. There are several alternative approaches for these cases. Functional Localizers One approach, which is popular in neuroimaging research, is to use a “functional localizer” condition to determine the time window and electrode sites for a given effect. For example, an experiment that uses the N170 component to examine some subtle aspect of face processing 12 (e.g., male faces versus female faces) could include a simple face-versus-nonface condition; the timing and scalp distribution of the N170 from the face-versus-nonface condition could then be used for quantifying N170 amplitude in the more subtle conditions. An advantage of this approach is that it can take into account subject-to-subject differences in latency and scalp distribution, which might be more sensitive than the usual one-size-fits-all approach. A disadvantage, however, is that it assumes that the timing and scalp distribution of the effect in the functional localizer condition is that same as in the conditions of interest, which may not be true (see Friston, Rotshtein, Geng, Sterzer, & Henson, 2006 for a description of the potential shortcomings of this approach in neuroimaging). Collapsed Localizers A related approach, which is becoming increasingly common, is to use a collapsed localizer. In this approach, the researcher simply averages the waveforms across the conditions that will ultimately be compared and then uses the timing and scalp distribution from the collapsed waveforms to define the analysis parameters that will be used for the non-collapsed data. For example, in an experiment designed to assess the N400 in two different conditions, the data could first be averaged across those two conditions, and then the time range and electrode sites showing the largest N400 activity could be used when measuring the N400 in the two conditions separately. There may be situations in which this approach would be problematic (see Luck, 2014), but it is often the best approach when the analysis parameters cannot be set on the basis of prior research. Window-Independent Measures Some methods for quantifying ERP amplitudes and latencies are highly dependent on the chosen time window, and other methods are relatively independent (see Luck, 2014). For 13 example, mean amplitude can vary by a great deal depending on the measurement window, whereas peak amplitude is less dependent on the precise window, especially when the largest peak in the waveform is being measured. Mean amplitude is typically superior to peak amplitude in other ways, however, such as sensitivity to high-frequency noise (Clayson, Baldwin, & Larson, 2013; Luck, 2014). Nonetheless, it may be appropriate to use peak amplitude when there is no good way of determining the measurement window for measuring mean amplitude. Another approach is to show that the statistical significance of a mean amplitude effect doesn’t depend on the specific measurement window (see, e.g., Bacigalupo & Luck, 2015). The Mass Univariate Approach Another approach is the mass univariate approach, in which a separate t test (or related statistic) is computed at every time point for every electrode site and some kind of correction for multiple comparisons is applied to control the overall Type I error rate. The traditional Bonferroni correction is usually unreasonably conservative, but a variety of other correction factors are now available (Groppe, Urbach, & Kutas, 2011a; Maris & Oostenveld, 2007) and are implemented in free, open-source analysis packages, such as the Mass Univariate Toolbox (Groppe, Urbach, & Kutas, 2011b) and FieldTrip (Oostenveld, Fries, Maris, & Schoffelen, 2011). These approaches are still fairly conservative, but they may be the best option when no a priori information is available to guide the choice of latency windows and electrode sites. Mathematical Isolation of the Latent Components Another approach is to use a method that attempts to mathematically isolate the underlying latent ERP components. For example, techniques such as source localization, independent component analysis, and spatial principal component analysis attempt to quantify the magnitude 14 of the underlying component at each time point, eliminating the need to select specific electrode sites for analysis. Similarly, temporal principal component analysis attempts to quantify the magnitude of the underlying component across each type of trial, eliminating the need to select specific time windows for analysis.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Pathological Study of the Governmental Press Industry and Identification of Their Ecosystem Aspects in Accordance with the New Media Environment

Objective: This research has been conducted so as to use the Porter Five Competitive Model and to obtain the votes of experts in the press industry. A pathological assessment of the industry in the country has been achieved and a desirable model has been developed for transforming the industry to face environmental change. Methods: In this research, the qualitative method has been applied to r...

متن کامل

Syria in Iranian Press

The main inquiry of this article is how differently the news of uprisings in Bahrain, Yemen and Syria has been covered in conservative and reformist press. In the section of theories, a review of agenda-setting and social constructionism has been offered and using theory of constructivism the role of values, identity and norms in social uprisings has been studied. In this article, ten newspaper...

متن کامل

Evaluation of Sizing Press Mill Performance in Slab Width Reduction Using FE Simulation

In this paper, properties of slab deformation in sizing press mill as one of the slab reduction processes in hot rolling mills have been evaluated using the elastoviscoplastic finite element method with explicit formulation. Effect of prarameters such as initial slab width and thickness, reduction, feed pitch, and anvil speed on factors such as dogbone formation, head and tail fishtail profile,...

متن کامل

مقایسه فعالیت میوالکتریکی منتخبی از عضلات اندام فوقانی هنگام اجرای پرس سینه در دو روش تمرینی تی آر ایکس و پرس سینه نیمکت با هالتر

Objective: The aim of present research is to compare the myoelectric activity of some chosen upper extremity muscles while doing bench press in two training methods of TRX and barbell bench press. Methods: Ten physically healthy students of physical education with experience in TRX participated in the study. Electromyography Bio Vision sixteen-channel and surface electrodes were used to calcul...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016